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DESCRIPTION 
Signal Processing Method, and Video Signal Processor 

Technical Field 

The present invention relates to a signal processing method for detecting and 
analyzing a pattern reflecting a semantics on which a signal is based, and a video 
signal processor for detecting and analyzing a visual and/or audio pattern reflecting a 
semantics on which a video signal is based. 



Background Art 

It is often desired to search, for playback, a desired part of a video application 
composed of a large amount of different video data, such as a television program 
recorded in a video recorder, for example. 

As a typical one of the image extraction techniques to extract a desired visual 
content, there has been proposed a story board which is a panel formed from a 
sequence of unages defining a main scene in a video application. Namely, a story 
board is prepared by decomposing a video data into so-called shots and displaying 
representative images of the respective shots. Most of the image extraction techniques 
are to automatically detect and extract shots from a video data as disclosed in "G. 
Ahanger and T. D. C. Little: A Survey of Technologies for Parsing and Indexing 
Digital Video, Joumal of Visual Communication and lmage Representation 7: 28-4, 



1996", for example. 

It should be noted that a typical half-hour TV program for example contains 
hundreds of shots. Therefore, with the above conventional image extraction technique 
of G. Ahanger and T. D. C. Little, the user has to examine a story board having listed 
therein enormous shots having been extracted. Understanding of such a story board 
will be a great burden to the user. Also, a dialogue scene in which for example two 
persons are talking will be considered here. In the dialogue, the two persons are 
alternately shot by a camera each time either of them speaks. Therefore, many of such 
shots extracted by the conventional image extraction technique are redundant. The 
shots contain many useless information since they are at too low level as objects from 
which a video structure is to be extracted. Thus, the conventional image extraction 
technique cannot be said to be convenient for extraction of such shots by the user. 

In addition to the above, further image extraction techniques have been 
proposed as disclosed in "A. Merlino, D. Morey and M. Maybury: Broadcast News 
Navigation Using Story Segmentation , Proceeding of ACM Muhimedia 97, 1997" 
and the Japanese Unexamined Patent Pubhcation No. 10-136297, for example. 
However, these techniques can only be used with very professional knowledge of 
limited genres of contents such as news and football game. These conventional image 
extraction techniques can assure a good result when directed for such limited genres 
but will be of no use for other than the limited genres. Such limitation of the 
techniques to special genres makes it difficult for the technique to easily prevail 



widely. 

Further, there has been proposed a still another image extraction technique as 
disclosed in the United States Patent No. 5,708,767 for example. It is to extract a so- 
called story imit. However, this conventional image extraction technique is not any 
completely automated one and thus a user's intervention is required to determine 
which shots have the same content. Also this technique needs a comphcated 
computation for signal processing and is only applicable to video infonnation. 

Furthermore, a still another image extraction technique has been proposed as 
in the Japanese Unexamined Patent Publication No. 9-214879, for example, in which 
shots are identified by a combination of shot detection and silent period detection. 
However, this conventional technique can be used only when the silent period 
corresponds with a boundary between shots. 

Moreover, a yet another image extraction technique has been proposed as 
disclosed in "H. Aoki, S. Shimotsuji and O. Hori: A Shot Classification Method to 
Select Effective Key-Frames for Video Browsing, IPS J Human Interface SIG Notes, 
7:43-50, 1996" and the Japanese Unexamined Patent Pubhcation No. 9-93588 for 
example, in which repeated similar shots are detected to reduce the redundancy of the 
depiction in a story board. However, this conventional image extraction technique is 
only applicable to visual infonnation, not to audio infonnation. 

Disclosure of the Invention - _ _ 
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Accordingly, the present invention has an object to overcome the above- 
mentioned drawbacks of the prior art by providing a signal processing method and 
video signal processor, which can extract a high-level video structure in a variety of 
video data. 

The above object can be attained by providing a signal processing method for 
detecting and analyzing a pattem reflecting the semantics of the content of a signal, 
the method including, according to the present invention, steps of: extracting, from a 
segment consisting of a sequence of consecutive frames forming together the signal, 
at least one feature which characterizes the properties of the segment; calculating, 
using the extracted feature, a criterion for measurement of a similarity between a pair 
of segments for every extracted feature and measuring a similarity between a pair of 
segments according to the similarity measurement criterion; and detecting, according 
to the feature and siinilarity measurement criterion, two of the segments, whose mutual 
time gap is within a predetemiined temporal threshold and mutual dissimilarity is less 
than a predetennined dissimilarity threshold, and grouping the segments into a scene 
consisting of a sequence of temporally consecutive segments reflecting the semantics 
of the signal content. 

In the above signal processing method according to the present invention, 
similar segments in the signal are detected and grouped into a scene. 

Also the above object can be attained by providing a video signal processor for 
detecting and analyzing a visual and/or audjo pattem reflecting the semantics of the 



content of a supplied video signal, the apparatus including according to the present 
invention: means for extracting, from a visual and/or audio segment consisting of a 
sequence of consecutive visual and/or audio frames forming together the video signal, 
at least one feature which characterizes the properties of the visual and/or audio 
segment; means for calculating, using the extracted feature, a criterion for 
measurement of a similarity between a pair of visual segments and/or audio segments 
for every extracted feature and measuring a similarity between a pair of visual 
segments and/or audio segments according to the similarity measurement criterion; 
and means for detecting, according to the feature and similarity measurement criterion, 
two of the visual seginents and/or audio segments, whose mutual time gap is within a 
predetennined temporal threshold and mutual dissimilarity is less than a predetemiined 
dissiiTiilarity threshold, and grouping the visual segments and/or audio segments into 
a scene consisting of a sequence of temporally consecutive visual segments and/or 
audio segments reflecting the semantics of the video signal content. 

In the above video signal processor according to the present invention, siirdlar 
visual segments and/or audio seginents in the video signal are detected and grouped 
for output as a scene. 

Brief Description of the Drawings 

FIG. 1 explains the structure of a video data to which the present invention is 
applicable,-using a video data mpdeL 
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FIG. 2 explains a scene. 

FIG. 3 is a block diagram of an embodiment of the video signal processor 
according to the present invention. 

FIG. 4 is a flow chart of a series of operations effected in detecting segments 
and grouping them into a scene in the video signal processor. 

FIG, 5 explains the sampling of dynamic features in the video signal processor. 

FIG. 6 explains the dissimilarity threshold. 

FIG. 7 explains the temporal threshold. 

FIG. 8 is a flow chart of a series of operations effected in grouping segments in 
the video signal processor. 

Best Mode for Carrying Out the Invention 

The embodiment of the present invention will further be described below with 
reference to the accompanying drawings. 

The embodiment of the present invention is a video signal processor in which 
a desired content is automatically detected and extracted from a recorded video data. 
Before going to the further description of the video signal processor, a video data to 
which the present invention is applicable will first be described. 

FIG. 1 shows a video data model having a hierarchy having three levels such as 
frames, segments and scenes, to which the present invention is applicable. As seen, the 
video data model includes a sequence of^ames at the lowest level. Also the video 



data model includes a sequence of consecutive segments at a level one step higher than 
the level of the frames. Further, the video data model includes scenes at the highest 
level. That is, a video data is formed from the scenes each consisting of the segments 
grouped together based on a meaningfril relation between them. 

The video data includes both visual infonnation and audio information. That 
is, the frames in the video data include visual frames each being a single still image, 
and audio frames representing audio infonnation having generally been sampled for 
a time as short as tens to hundreds of milhseconds. 

As in the video data model, each of segments is comprised of a sequence of 
visual frames having consecutively been picked up by a single camera. Generally, the 
segment is called "shot". The segments include visual segments and/or audio 
segments, and each segment is the fiindamental unit of a video structure. Especially, 
the audio segments among these segments can be defined in many different manners 
as will be described below by way of example. First, audio segments are bounded by 
periods of silence, respectively, in a video data detected by the well-known method, 
as the case may be. Also, in some cases, each audio segment is formed from a 
sequence of audio frames classified in several categories such as speech, music, noise, 
silence, etc. as disclosed in "D. Kimber and L. Wilcox: Acoustic Segmentation for 
Audio Browsers, Xerox Pare Technical Report". Further, in other cases, the audio 
segments are detennined based on an audio cut point which is a large variation of a 
certain featiire from one to the other of two^uccessive audio frames, asjiisclosed in 



c:ii Mi it;-,, "7 ;i; ii ,1 .., ii £? ii;,!! ii a 

8 

" S. Pfeiffer, S. Fischer and E. Wolfgang: Automatic Audio Content Analysis, 
Proceeding of ACM Multimedia 96, Nov. 1996, pp21-30", for example. 

Further, to describe the content of a video data at a higher level based on its 
semantics, the scene is made up of a meaningful group of features of segments such 
as perceptual activities in the segments acquired by detecting visual segments or shots 
or audio segments. The scene is subjective and depends upon the content or genre of 
the video data. The scene referred to herein is a group of repeated patterns of visual 
or audio segments whose features are similar to each other. More specifically, in a 
scene of a dialogue between two persons for example, visual segments appear 
alternately each time one of them speaks as shown in FIG. 2. In a video data having 
such a repeated pattern, a sequence of visual segments A of one of the two speakers 
and a sequence of visual seginents B of the other speaker are grouped into one scene. 
The repeated pattern has a close relation with a hihg-level meaningful structure in the 
video data, and represents a high-level meaningful block in the video data. 

Referring now to FIG. 3, there is schematically illustrated the video signal 
processor according to the present invention. The video signal processor is generally 
indicated with a reference 10. In the video signal processor 10, the features of 
segments in the video data are used to determine the inter- seginent similarity, group 
these segments into a scene, and automatically extract the video structure of the scene. 
Thus, the video signal processor 10 is applicable to both visual and audio segments. 
As shown in FIG. 3, the yideo signal processor ^0 includes a video segmentor 
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1 1 to segment or divide an input video data stream into visual or audio seginents or 
into both, a video segment memory 12 to store the seginents of the video data, a visual 
feature extractor 13 to extract a feature for each visual seginent, an audio feature 
extractor 14 to extract a feature for each audio segment, a segment feature memory 15 
to store the features of the visual and audio segments, a scene detector 16 in which the 
visual and audio segments are grouped into a scene, and a feature similarity 
measurement block 17 to determine a similarity between two segments. 

The video segmentor 1 1 is supphed with a video data stream consisting of 
visual and audio data in any one of various digital formats including compressed video 
formats such as Moving Picture Experts Group Phase 1 (MPEGl), Moving Picture 
Experts Group Phase 2 (MPEG2) and digital video (DV), and divides the video data 
into visual or audio segments or into both segments. When the input video data is in 
a compressed format, the video segmentor 1 1 can directly process the compressed 
video data without fully expanding it. The video segmentor 1 1 divides the input video 
data into visual or audio segments or into both segments. Also, the video segmentor 
11 supplies the downstreain video segment memory 12 with information segments 
resuked from the segmentation of the input video data. Further, the video segmentor 
11 supplies the information segments selectively to the downstream visual feature 
extractor 13 and audio feature extractor 14, depending upon whether the infonxiation 
is visual or audio segments. 

The_video segment inemgry 1 2 stores the information seginents ofvideo data 
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supplied from the video segmentor 11. Also the video segment memory 12 supplies 
the infonnation seginents to the scene detector 16 upon query from the scene detector 
16. 

The visual feature extractor 13 extracts a feature for each visual segment 
resulted from segmentation of the video data by the video segiuentor 11. The visual 
feature extractor 13 can process a compressed video data without fixUy expanding it. 
It supplies the extracted feature of each visual segment to the downstream segment 
feature memory 1 5 . 

The audio feature extractor 14 extracts a feature for each audio segment 
resulted from segmentation of the video data by the video segmentor 11. The audio 
feature extractor 14 can process a compressed audio data without fully expanding it. 
It supplies the extracted feature of each audio segment to the downstream segment 
feature memory 15. 

The seginent feature memory 15 stores the visual and audio seginent features 
supplied from the visual and audio feature extractors 13 and 14, respectively. Upon 
query from the dovmstream feature siinilarity measurement block 11, the segment 
feature memory 15 supplies stored features and seginents to the feature similarity 
measurement block 17. 

The scene detector 16 groups the visual and audio segments into a scene using 
the information segments stored in the video segment memory 12 and the similarity 
between a pair of segments.. Th scen^detjsct^r 16 starts with each segment in a group 



11 ^ €=n H 7' 3. ti :1L . :i! ii-!! :1 iHl 

11 

to detect a repeated pattern of similar seginents in a group of segments, and group such 
segments into the same scene. The scene detector 16 groups together segments into 
a certain scene, gradually enlarges the group until all the segments are grouped, and 
finally produces a detected scene for output. Using the feature similarity measurement 
block 1 7, the scene detector 16 detennines how similar two segments are to each other. 

The feature similarity measurement block 17 determines a similarity between 
two seginents, and queries the segment feature memory 15 to retrieve the feature for 
a certain segment. 

Since repeated similar segments lying close to each other in time are generally 
a part of the same scene, the video signal processor 10 detects such segments and 
groups them to detect a scene. The video signal processor 10 detects a scene by 
effecting a series of operations as shown in FIG. 4. 

First at step S 1 in FIG. 4, the video signal processor 10 divides a video data into 
visual or audio segments as will be described below. The video signal processor 10 
divides a video data supplied to the video segmentor 1 1 into visual or audio segments 
or possibly into both segments. The video segmenting method employed in the video 
signal processor 10 is not any special one. For example, the video signal processor 10 
segments a video data by the method disclosed in the previously mentioned "G. 
Ahanger and T. D. C. Little: A Survey of Technologies for Parsing and Indexing 
Digital Video, Joumal of Visual Coimnunication and Image Representation 7: 28-4, 
_ 199_6"._ This video segmenting method is well known in this field of art^ The video 
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signal processor 10 according to the present invention can employ any video 
segmenting method. 

Next at step S2, the video signal processor 10 extracts a feature. More 
specifically, the video signal processor 10 calculates a feature which characterizes the 
properties of a segment by means of the visual feature extractor 13 and audio feature 
extractor 14. In the video signal processor 10, for example, a time duration of each 
segment, video or visual features such as color histogram and texture feature, audio 
features such as frequency analysis result, level and pitch, activity determination 
result, etc. are calculated as apphcable features. Of course, the video signal processor 
10 according to the present invention is not limited to these applicable features. 

Next at step S3, the video signal processor 10 measures a similarity between 
segments using their features. More specifically, the video signal processor 10 
measures a dissimilarity between segments by the feature similarity measurement 
block 17 and determines how similar two segments are to each other according to the 
feature similarity measurement criterion of the feature similarity measurement block 
17. Using the features having been extracted at step S2, the video signal processor 10 
calculates a criterion for measurement of dissimilarity. 

At step S4, the video signal processor 10 groups the segments. More 
particularly, using the dissimilarity measurement criteria calculated at step S3 and 
features extracted at step S2, the video signal processor 10 iteratively groups similar 
segments lying cl_os_e to jcach other injdme. Thus, the video signal processor 10 
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provides a finally produced group as a detected scene. 

With the above series of operations, the video signal processor 10 can detect a 
scene fi-om a video data. Therefore, using the above result, the user can sum the 
content of the video data and quickly access to points of interest in the video data. 

The operation of the video signal processor 10 at each of the steps shown in 
FIG. 4 will further be described below. 

First the video segmentation at step SI will be discussed herebelow. The video 
signal processor 10 divides a video data supplied to the video segmentor 1 1 into visual 
or audio segments or into both segments if possible. Many techniques are available 
for automatic detection of a boundary between segments in a video data. As 
mentioned above, the video signal processor 10 according to the present invention is 
not limited to any special video segmenting method. On the other hand, the accuracy 
of scene detection in the video signal processor 10 substantially depends upon the 
accuracy of the video segmentation which is to be done before the scene detection. 
It should be noted that in the video signal processor 10, the scene detection can be 
done even with some error in the video segmentation. In this video signal processor 
10, excessive segment detection is more preferable than insufficient one for the video 
segmentation. Namely, so long as excessive similar segments are detected, they can 
be grouped as those included in the same scene. 

Next the feature detection at step S2 will be discussed herebelow. The features 
are attributes of segments, c harac terizing the properties of the segments and providmg 
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infonnation according to which a similarity between different segments is measured. 
In the video signal processor 10, the visual and audio feature extractors 13 and 14 
calculate visual and audio features for each segment. However, the video signal 
processor 10 is not limited to any special features. The features considered to be 
effectively usable in the video signal processor 1 0 include visual feature, audio feature 
and visual- audio feature as will be described below. The requirement for these 
features usable in the video signal processor 10 is that they should be ones from which 
a dissimilarity can be determined. For a higher efficiency of signal processing, the 
video signal processor 10 simultaneously effects a feature extraction and video 
segmentation as the case may be. The features which will be described below meet the 
above requirement. 

The features include first a one concerning an image (will be referred to as 
"visual feature" hereinafter). A visual segment is composed of successive visual 
frames. Therefore, an appropriate one of the visual segments can be extracted to 
characterize the content the visual segment depicts, by the extracted visual frame. 
Namely, a similarity of the appropriately extracted visual frame can be used as a 
similarity between visual segments. Thus, the visual feature is an important one of the 
iinportant features usable in the video signal processor 10. The visual feature can 
represent by itself only static infonnation. Using a method which will be described 
later, the video signal processor 10 can extract a dynamic feature of visual segments 
_ based on the visual feature, _ 
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The visual features include many well-known ones. However, since it has been 
found that color feature (histogram) and video correlation, which will be described 
below, provide a good compromise between the cost and accuracy of calculation for 
the scene detection, the video signal processor 10 will use the color feature and video 
correlation as visual features. 

In the video signal processor 10, colors of images are important materials for 
determination of a similarity between two images. The use of a color histogram for 
determination of a similarity between images is well known as disclosed in, for 
example, "G. Ahanger and T. D. C. Little: A Survey of Technologies for Parsing and 
Indexing Digital Video, Journal of Visual Communication and Image Representation 
7: 28-4, 1996". It should be noted that the color histogram is acquired by dividing a 
three-dimensional space such as HSV, RGB or the like for example into n areas and 
calculating a relative ratio of frequency of appearance in each area of pixels of an 
image. Information thus acquired gives an n-dimensional vector. Also, a color 
histogram can be extracted directly from a compressed video data as disclosed in the 
United States Patent No. 5,708,767. 

The video signal processor 10 uses a 64 -dimensional (== 2^^-diinensional) 
histogram vector acquired by sampling, at a rate of 2 bits per color channel, an original 
YUV color space in images fonning a segment. 

Such a histogram represents a total color tone of an image but includes no 
timing_data._ For this reason, a_yidep correlation is cal culated as another visual feature 



ij. '9i H-^?' ?i;ir.]i :!!, ,., :1L, I-.;!: ;.1L. !l.Ji ti'li 



16 

in the video signal processor 10. For the scene detection in the video signal processor 
10, mutual overlapping of a plurality of similar seginents is an important index 
indicating that the seginents form together a scene structure. For example, in a 
dialogue scene, the camera is moved between two persons alternately and to one of 
them being currently speaking. Usually, for shooting the same person again, the 
camera is moved back to nearly the same position where he or she was previously shot. 
Since it has been found that for detection of such a scene structure, a correlation based 
on reduced grayscale images is a good index for a similarity between segments, initial 
images are thinned and reduced to grayscale images each of M x N (both M and N 
may be small values; for example, M x N may be 8 x 8) in size and a video correlation 
is calculated using the reduced grayscale images in the video signal processor 10. That 
is, the reduced gray scale images are interpreted as an MN-dimensional feature vector. 

The features different from the above-mentioned visual feature concern a 
sound. This featiire will be referred to as "audio feature" hereinafter. The audio 
feature can represent the content of an audio segment. In the video signal processor 
10, a frequency analysis, pitch, level, etc. may be used as audio features. These audio 
features are known from various documents. 

First, the video signal processor 10 can make a frequency analysis of a Fourier 
Transform component or the like to detennine the distribution of frequency 
infonnation in a single audio frame. For example, the video signal processor 10 can 
use FFT (Fast Fourier Transfonn) component, frequency his^^^^ power spectrum 
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and other features. 

Also, the video signal processor 10 may use pitches such as a mean pitch and 
maximum pitch, and sound levels such as mean loudness and maximmn loudness, as 
effective audio features for representation of audio segments. 

Further features are those common to visual and audio segments. They are 
neither any visual feature nor audio feature, but provide useful information for 
representation of features of segments included in a scene. The video signal processor 
10 uses a segment length and an activity, as common visual-audio features. 

As in the above, the video signal processor 10 can use a segment length as a 
common visual-audio feature. The segment length is a time length of a segment. 
Generally, a scene has a rhythm feature inherent to itself The rhythm feature appears 
as a change of segment length in a scene. For example, short segments contiguous to 
each other with a short time between them represent a coiumercial program. On the 
other hand, segments included in a conversation or dialogue scene are longer than 
those in a commercial program. Also, segments associated with each other in a 
dialogue scene are similar to each other. The video signal processor 10 can use such 
feature segment length as a common visual-audio feature. 

Also, the video signal processor 10 can use an activity as a coiTanon visual- 
audio feature. The activity is an index indicating how dynamic or static the content 
of a segment feels. For example, if a segment visually feels dynamic, the activity 
ind icates a rapidity with which a camera is moved along an obj ect or with which an 
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object being shot by the camera changes. 

The activity is indirectly calculated by measuring a inean value of inter-frame 
dissimilarity in feature such as color histogram, A video activity Vp is given by the 
following equation (1): 

'^ZdjrO .1+1) 
_ izM-L 

^ " ^ ' (1) 

where i and j are frmnes, F is a feature measured between the frames i and dpCi, j) is 
a dissimilarity measurement criterion for the feature dp, and b and f are numbers for 
a first frame and last frame in one segment. 

More specifically, the video signal processor 10 can calculate the video activity 
Vp using the above-mentioned histogram for example. 

The features including the above-mentioned visual features basically indicate 
static information of a seginent as in the above. To accurately represent the feature 
of a segment, however, dynainic information has to be taken in consideration. For this 
reason, the video signal processor 10 represents dynamic information by a feature 
sampling method which will be described below: 

As shown in FIG. 5 for example, the video signal processor 10 extracts more 
than one static feature, starting at different tune points in one segment. At this time, 
the video signal processor 10 detennines the nmnber of features to extract by keeping 
a balance between a highest fidelity of seginent depiction and a minimum data 
. redundancy.. For example, when a certain image m th^ designated as 
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a key frame in that segment, a histogram calculated from the key frame will be a 
featm*e to extract. 

Using the sampling method which will be described later, the video signal 
processor 10 determines which of the samples extractable as feature of an object 
segment are to be selected. 

Here, it will be considered that a certain sample is always selected at a 
predetermined time point, for example, at the last time point in a segment. In this case, 
samples from two arbitrary segments changing to black frames (fading) will be same 
black frames, so that no different features will possibly be acquired. That is, selected 
two frames will be determined to be extremely siirdlar to each other whatever the 
image contents of such segments are. This problem will take place since the samples 
are not good central values. 

For this reason, the video signal processor 10 is adapted not to extract a feature 
at such a fixed point but to extract a statistically central value in an entire segment. 
Here, the general feature sampling method will be described conceming two cases that 
(1) a feature can be represented as a real-number n-dimensional vector and (2) only 
a dissimilarity measurement criterion can be used. It should be noted that best-known 
visual and audio features such as histogram, power spectrmn, etc. are included in the 
features in the case (1). 

In the case (1), the number of samples is predetermined to be k and the video 
signal -processor_l 0- automatically. seginents_a feature_of an entire_segment into k _ 
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different groups by using the well-known k-means clustering method as disclosed in 
"L. Kaufinan and P. J. Rousseeuw: Finding Groups in Data: An Introduction to 
Cluster Analysis, John- Wiley and Sons, 1990". The video signal processor 10 selects, 
as a sample value, a group centroid or a sample approximate to the centroid from each 
of the k groups. The complexity of the operations in the video signal processor 10 is 
just the linearly increased number of samples. 

On the other hand, in the case (2), the video signal processor 10 forais the k 
groups by the use of the k-medoids algorithm method also disclosed in "L. Kaufinan 
and P. J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis, 
John- Wiley and Sons, 1990". The video signal processor 10 uses, as a sample value, 
the above-mentioned group medoid for each of the k groups. 

It should be noted that in the video signal processor 10, the method for 
establishing the dissimilarity measurement criterion for features representing extracted 
dynamic features is based on the dissiinilarity measurement criterion for static features 
on which the former method is based, which will further be described later. 

Thus, the video signal processor 10 can extract a plurality of static features to 
represent a dynamic feature using the plurality of static features. 

As in the above, the video signal processor 10 can extract various features. 
However, each of such features is generally insufficient for representation, by itself, 
of a segment feature. For this reason, the video signal processor 10 can select a set of 
mutually complementary features by combining these different features. For example, 
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the video signal processor 10 can provide more infonriation than that of each feature 
by combining the above-mentioned color histogram and image correlation with each 
other. 

Next, the measurement of similarity between segments, in which the features 
acquired at step S3 in FIG. 4 are used, will be described herebelow. Using the 
dissimilarity measurement criterion being a function to calculate a real-number value 
with which it is detennined how dissimilar two features are to each other, the video 
signal processor 10 measures a dissimilarity between the segments by means of the 
feature similarity measurement block 17. When the dissimilarity measurement 
criterion is small, it indicates that two features are similar to each other. If the 
criterion is large, it indicates that the two featvires are not similar to each other. The 
function for calculation of the dissimilarity between the two segments S, and 
concerning the feature F is defined as dissimilarity measurement criterion dpCS,, S2). 
This function has to meet the relation given by the equations (2) below. 
dpCSi, S2) = 0 (when Si = Sj) 

dp(Si, S2) = 0 (for all S, and S2) (2) 

dp(Si, S2) = dp(S2, SO (for all S, and S^) 

It should be noted that some of the dissiinilarity measurement criteria is only 
appHcable to specific features. However, as disclosed in "G. Ahanger and T. D. C. 
Little: A Survey of Technologies for Parsing and Indexing Digital Video, Journal of 
Visual Coiiuntinication^d Image Representari^^ 
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and P. J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis, 
John- Wiley and Sons, 1990", many dissimilarity measurement criteria are generally 
applicable to the measurement of a similarity between features represented as points 
in a n-dimensional space. The features include a Euclidean distance, inner product, 
LI distance, etc. Since of these features, the LI distance will effectively act on various 
features including the histogram, image correlation, etc., the video signal processor 10 
adopts the LI distance as a feature. On the assumption that two n-dimensional vectors 
are A and B, the LI distance dLi(A, B) is given by the following equation (3): 

where the subscript i indicates the i-th element of each of the n-dimensional vectors 
AandB. 

As mentioned above, the video signal processor 10 extracts, as features 
representing dynamic features, static features at different time points in a segment. 
Then, to determine a similarity between two extracted dynamic features, the video 
signal processor 10 uses, as a criterion for determination of a dissimilarity, a criterion 
for measurement of a dissimilarity between the static features on which the similarity 
measurement criterion is based. In many cases, the dissimilarity measurement 
criterion for the dynamic features should most advantageously be established using a 
dissimilarity between a pair of static features selected from each dynamic feature and 
most similar to each other. In this case, the criterion for measurement of^ 
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dissimilarity between two extracted dynamic features SF^ and SF2 is given by the 
following equation (4): 

^5(^^1.^^2)-i7 €5^F2 G^i^/^(^l'^2) (4) 

The function dp(Fi, F2) in the equation (4) above indicates a criterion for 
measurement of a dissimilarity between the static features F on which the equation (4) 
is based. It should be noted that the a maximum or mean value of the dissimilarity 
between features may be taken instead of a maximum value as the case may be. 

In the video signal processor 10, only one feature is insufficient to detennine 
a similarity between segments, and so in many cases it is necessary to combine 
information derived fi^om many features for the same seginent. A solution for this 
problem is to calculate a dissimilarity based on various features as a combination of 
respective features as weighted. That is, when there are available k features Fj, F2, 
Ft,, the video signal processor 10 uses a dissimilarity measurement criterion dp(Sj, S2) 
for combined features. The criterion is given by the following equation (5): 

(5) 

where {w^} is a weighting factor of S^Wi^l. 

As in the above, the video signal processor 10 can calculate a dissimilarity 
measurement criterion using features having been extracted at step S2 in FIG. 4 to 
_detennine a simil arity between segments in consid eration . 
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Next, the segment grouping at step S4 in FIG. 4 will be described herebelow. 
Using the dissimilarity measurement criterion and extracted features, the video signal 
processor 10 repeatedly groups siixiilar segments lying close to each other in time, and 
outputs a finally produced group as a detected scene. 

When detecting a scene by grouping seginents, the video signal processor 10 
effects two basic operations. One of the operations is to detect groups of similar 
segments lying close to each other in time. Most of the groups thus acquired will be 
a part of the same scene. The other operation effected in the video signal processor 
10 is to combine concurrent segment groups together. The video signal processor 10 
starts these operations with independent segments, and repeats them. Then the video 
signal processor 10 organizes a step-by-step larger group of segiuents and outputs a 
finally produced group as a set of scenes. 

To control these operations, the video signal processor 10 is controlled under 
the following two constraints. 

Under one of the two constraints, the video signal processor 10 has to adopt a 
dissimilarity threshold 3^,^ to detenriine whether two similar segments belong to the 
same scene. As shown in FIG. 6 for example, the video signal processor 10 judges 
whether one of the segments is similar or not similar to the other. 

It should be noted that the video signal processor 10 may be adapted to set the 
dissimilarity threshold S^^^ by the user or automatically as will be described later. 
Under the se cond constraint, the video signal processor 1 0 has to adopt a 



25 

temporal threshold T as a maxhnum interval between two segments on the time base, 
based on which the two segments can be considered to be included in the same scene. 
As shown in FIG. 7 for example, the video signal processor 10 puts, into the same 
scene, two similar seginents A and B lying close to each other within the temporal 
threshold T but not two segments B and C similar to each other but having between 
them a time gap not within the temporal threshold T. Thus, because of the constraint 
by the temporal threshold T, the video signal processor 1 0 will not erroneously put into 
the same scene two segments similar to each other but largely apart from each other 
on the tiine base. 

Since it has been found that a time for 6 to 8 shots set as the temporal threshold 
T would generally give a good result, the video signal processor 10 uses the temporal 
threshold T for 6 to 8 shots in principle. 

It is assumed herein that to acquire a group of similar seginents, the video signal 
processor 10 adopts the hierarchical clustering method disclosed in "L. Kaufinan and 
P. J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis, John- 
Wiley and Sons, 1990". In this algorithm, a criterion d^;(Ci, C2) for determination of 
a dissimilarity between two clusters Cj, C2 is defined as a miniiumn siinilarity between 
elements included in each cluster. It is given by the following equation (6): 

It _should_ be_ not_ed that in t he video^ signal proce ssor 1 0 , a m inimum functi on 
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expressed by the equation (6) can easily be replaced with a maximum function or mean 
function. 

First at step SI 1 in FIG. 8, the video signal processor 10 initializes a variable 
N to the nmnber of segments in the initial state. The variable N indicates the 
concurrent nmnber of groups always detected. 

Next at step S12, the video signal processor 10 generates a set of clusters. In 
the initial state, the video signal processor 10 takes N segments as different from each 
other. That is, there exist N clusters in the initial state. Each of the clusters has 
features indicating a start time and end time represented by C''*^ and C^''*, respectively. 
Elements included in each cluster are managed as a list in which they are arranged in 
order based on the start time C'^^. 

Next at step S13, the video signal processor 10 initializes a variable t to 1. At 
step S 14, the video signal processor 10 judges whether the variable t is larger than the 
temporal threshold T. If the video signal processor 10 determines that the variable t 
is larger than the temporal threshold T, it will go to step S23. When it detemiines that 
the variable t is smaller than the temporal threshold T, it will go to step S 15, Since the 
variable t is 1, however, the video signal processor 10 will go to step S15. 

At step S15, the video signal processor 10 calculates the dissimilarity 
measurement criterion d^ to detect two of the N clusters that are the most similar to 
each other. Since the variable t is 1, however, the video signal processor 10 will 
ca lcul ate the dissimilarityineasurernent criterion d^ between adjacent clusters to det^c^ 
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among the adjacent clusters a pair of clusters that are the most similar to each other. 

An approach to detect two clusters which are the most similar to each other may 
be to acquire all possible pairs of object clusters. However, since the variable t 
indicating a time interval between object clusters is given in segments and the clusters 
are arranged in the temporal order, the video signal processor 10 should calculate the 
dissimilarity among t clusters before and after a certain cluster. 

The two clusters thus detected are defined as C^ and Cj, respectively, and a 
dissimilarity between the clusters and Cj is defined as dy. 

At step SI 6, the video signal processor 10 will judge whether the dissimilarity 
dy is larger than the dissimilarity threshold d^^^. When the dissimilarity djj is judged 
larger than the dissimilarity threshold S^u,,, the video signal processor 10 will go to step 
S2 1 . If the the dissimilarity d^ is judged smaller than the dissimilarity threshold S^^^, 
the video signal processor 10 will go to step SI 7. It is assumed here that the 
dissimilarity dy is smaller than the dissimilarity threshold S^j,^. 

At step SI 7, the video signal processor 10 will merge the cluster Cj into the 
cluster Cj. That is, the video signal processor 10 will add to the cluster Cj all the 
elements in the cluster Cj. 

Next at step SI 8, the video signal processor 10 will remove the cluster Cj from 
the set of clusters. It should be noted that if the start time Cj^^ changes due to the 
combination of the two clusters and Cj, the video signal processor 10 will rearrange 
the elements in the set of clusters based on the start time C-^^^'\ 
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Next at step S 1 9, the video signal processor 10 will subtract 1 from the variable 

N. 

At step S20, the video signal processor 10 will judge whether the variable N is 
1 or not. If the variable N is judged to be 1, the video signal processor 10 will go to 
step S23. When the video signal processor 10 determines that the variable N is not 1, 
it will go to step S15. It is assuined here that the variable N is not 1. 

Thus, at step S 15, the video signal processor 10 will calculate the dissimilarity 
measurement criterion d^ again to detect two clusters the most similar to each other. 
Since the variable t is 1, the video signal processor 10 will calculate the criterion d^ for 
determination of the dissimilarity between adjacent clusters to detect a pair of clusters 
that are the most similar to each other. 

Next at step SI 6, the video signal processor 10 will judge whether the 
dissimilarity dy is larger than the dissimilarity threshold S^i^. It is also assumed here 
that the dissimilarity dy is smaller than the dissimilarity threshold 5^^, 

The video signal processor 10 will effect the operations at steps S17 to S20. 

When as a result of the repetition of the above operations and subtraction of 1 
from the variable N, it is determined at step S20 that the variable N is 1, the video 
signal processor 10 will go to step S23 where it will combine together clusters each 
including a single segment. Finally, the video signal processor 10 terminates the series 
of operations by grouping all segments into one cluster as in the above. 

If the video signal processor 10 determine^at step S 16 that the dissimilaritydij 
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is larger than the dissimilarity threshold 5,i,„, it will go to step S21 where it will 
repeatedly combine clusters which concurrently exist. Namely, if the time interval 
between C-^'"^ and C^"^ of the cluster Ci is concurrent with that between Cj'"^^ and C/"^ 
of the cluster Cj, the two clusters Ci and Cj overlap each other on the time base. Thus, 
the video signal processor 10 can arrange the clusters in a set based on the start time 
Q sx^ of the cluster set to detect concurrent clusters and combine the clusters together. 

At step S22, the video signal processor 10 will add 1 to the variable t which will 
thus be t = 2, and go to step S14 where it will judge whether the variable t is larger 
than the temporal threshold T. It is also assumed here that the variable t is smaller 
than the temporal threshold T and the video signal processor 10 will go to step SI 5. 

At step SI 5, the video signal processor 10 will calculate the dissimilarity 
measurement criterion d^ and detects two of a plurality of clusters existing currently, 
that are the most siinilar to each other. However, since the variable t is 2, the video 
signal processor 10 calculates the criterion d^, for determination of the dissimilarity 
between adjacent clusters as well as between every other clusters to detect a pair of 
clusters the most similar to each other. 

Then at step S 1 6, the video signal processor 10 judges whether the dissimilarity 
djj between adjacent clusters and every other clusters Q and Cj is larger than the 
dissimilarity threshold 3^^^. It is assumed here that the dissimilarity dy is smaller than 
the dissimilarity threshold 3^^^. After effecting the operations at steps S2 1 and S22, the 
video sig nal processor 10 a dds 1 Jo the yariabie J whicji wil l th us be t^ 3, and will 
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move to step S14 and subsequent steps. When the variable t is 3, the video signal 
processor 10 will calculate, at step SI 5, the criterion d^ for detennination of the 
disshnilarity between clusters existent down to every two clusters, and detect a pair of 
clusters which are the most similar to each other. 

When as a result of the repetition of the above operations and addition of 1 to 
the variable t, it is determined at step S14 that the variable t is larger than the tine 
threshold T, the video signal processor 10 will go to step S23 where it will combine 
clusters each including a single segment. That is, the video signal processor 10 will 
take discrete clusters as ones each including a single segment. If there exist a 
sequence of such clusters, the video signal processor 10 will combine them together. 
This process combines together segments having no relation in similarity with any 
adjacent scene. However, it should be noted that the video signal processor 10 has not 
to always effect this process. 

With this series of operations, the video signal processor 10 can gather the 
plurality of clusters and generate a scene to be detected. 

It should be noted that the video signal processor 10 may be adapted to set the 
dissimilarity threshold 5,i^ by the user or automatically detennine it as having 
previously been described. However, when a fixed value is used as the dissimilarity 
threshold 8^^, the optimum value of the dissimilarity threshold 6^^ will depend upon 
the content of a video data. For example, for a video data whose content is variable, 
- the dissimilarity threshold 5sim has to_ be setjlo_a W the other hand, for 
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a video data having a less-variable content, the dissimilarity threshold d^^^ has to be set 
to a low value. Generally, when the dissimilarity threshold S^i^ is high, less scenes will 
be detected. On the other hand, when dissimilarity threshold S^j^ is low, more scenes 
will be detected. 

Thus, an optimxim dissimilarity threshold S^jj^ has to be determined since the 
performance of the video signal processor 10 depends greatly upon the dissimilarity 
threshold 6^^^, Therefore, when the video signal processor 10 is adapted to set a 
dissimilarity threshold S^^^ by the user, the above has to be taken in consideration. On 
the other hand, the video signal processor 10 may be adapted to automatically set an 
effective dissimilarity threshold 8^^ by using any of methods which will be described 
below. 

One of the methods will be described by way of example. Namely, the video 
signal processor 10 can acquire a dissimilarity threshold 5^^^ by using a statistic 
quantity such as mean value and median in distribution of the dissimilarity between 
(n)(n- l)/2 segment pairs. Assume here that the mean value and standard deviation of 
the dissimilarity between all segment pairs is and a, respectively. In this case, the 
dissimilarity threshold 5^^^ can be represented by a form of a|^ + ba where a and b are 
constants. It has been found that setting of the constants a and b to 0.5 and 0.1, 
respectively, will assure a good result. 

The video signal processor 10 has not to determine any dissimilarity between 
all _pairs_oiLs_egments but should measure a diss imilarit y by s elec ting at random from 
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a set of all segment pairs a sufficient number of segment pairs to provide nearly real 
mean value ii and standard deviation a. Using the mean value |i and standard 
deviation a thus detemiined, the video signal processor 10 can automatically detemune 
an appropriate dissimilarity threshold d^^^. 

In the foregoing, the use of a single dissimilarity measurement criterion in the 
video signal processor 10 has been described. In addition, the video signal processor 
10 can a weighting function to combine a variety of dissimilarity measurement criteria 
for different types of features in order to judge whether segments in pairs belong to the 
same group, as having previously been described. The features can only be weighted 
after trial and error, and when the features are different in type fi-om each other, it is 
usually difficult to appropriately weight them. However, using a color histogram and 
texture feature, for example, in combination, the video signal processor 10 can detect 
possible scenes for these features and synthesize a single scene stmcture from the 
structures of the detected scenes, thereby permitting to detect a scene in which both 
the color histogram and texture feature are included. Each of the resuhs of scene 
detection for the features will be referred to as "scene layer" hereinafter. For example, 
when a color histogram and segment length are used as features, the video signal 
processor 10 can detect scenes which are based on the features, respectively, to 
provide a scene layer for the color histogram and a one for the segment length. The 
video signal processor 10 can combine these scene layers into a single scene structure. 
Generally, info n natio n from video and au dio domains ca nnot be combined 
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in principle. Using a similar method to that for combining structures based on features 
different in quality from each other, the video signal processor 10 can combine into 
a single scene structure scene layers obtainable based on infonnation from video and 
audio domains. 

Such an algorithin will be described herebelow. It is assmned here that there 
are k features Fj, F2, Fj, each representing one siinilarity criterion and there are 
available a dissimilarity measurement criterion dp , dissimilarity threshold and a 
temporal threshold correspondingly to the features F^. Using the dissimilarity 
measurement criterion dp*, dissimilarity threshold 8\^^ and a temporal threshold T* for 
the features F^, the video signal processor 10 detects a set of scene layers = V^^}- 
For example, the video signal processor 10 detects divisional scene layers for video 
and audio information, respectively, and generates two independent scene layers = 
{Xi^} (i= 1,2) for the video and audio information, respectively. 

The video signal processor 10 has to determine how to combine scene 
boundaries for combination of difference scene layers into a single scene structure. 
The scene boundaries do not always match one another. It is assumed here that for the 
scene layers, boiandary points represented by a sequence of times indicating the scene 
boundaries are tji, ti2, ti|Xi|. The video signal processor 10 first selects a certain 
scene layer to be a basis for aligmnent of the boundary points in order to combine 
various scene layers into a single group. Then, the video signal processor 10 
detenmnes for jcaciL o f the boundary points t^^, ti2,__. .^^Jjxil whethei^other scenejay ers 
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are those in the scene structure produced by finally combining the scene layers. 

It is assumed here that the logical function indicating whether the i-th scene 
layer Xj has a boundary point near a time t is Bj(t). The tenn "near" varies depending 
upon the situation of the scene layer X^, and it is for example 0.5 sec for combining 
scene layers based on video and audio infonnation, respectively. 

The video signal processor 10 calculates the logical function Bi(tj) for each of 
the boundary points tj ^ t^j when j = 1, |Xj| and i = 1, k. The calculation result 
will indicate whether the boundary point exists near the time tj for each of the scene 
layers. The video signal processor 10 uses Bi(tj) as a decision function when 
determining whether in the combined scene structure, the tiine tj is a scene boundary. 

A simple example of the decision function is to count real B,(tj) and regard the 
time tj as the scene boundary of the combined scene structure. Especially when m = 
1 , it means that the boundary points of all the scene layers are the boundary points of 
the final scene structure. On the other hand, when m = k, it means that a scene 
boundary regarded as common to all the scene layers is the boundary point of the 
combined scene structure. 

Thus, the video signal processor 10 can combine difference scene layers into 
a single scene structure. 

As having been described in the foregoing, the video signal processor 10 
according to the present invention is to extract a scene structure. It has already been 
proved by ma ny experiment s that the sig nal pro cessi ng method carried out by the 
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video signal processor 10 can be used to extract a scene structure from video data 
having various contents such as TV dramas, movies, etc. 

The video signal processor 1 0 is full automatic and can automatically detemndne 
an appropriate threshold correspondingly to a change in content of a video data 
without the necessity of any user's intervention to set the aforementioned dissimilarity 
threshold and temporal threshold. 

Further, the video signal processor 10 according to the present invention can be 
operated by the user without any prior knowledge of a semantics of a video data. 

Moreover, since the video signal processor 10 is very simple and can efficiently 
calculate, so it can be applied in home electronic appliances such as a set-top box, 
digital video recorder, home server, etc. 

Also the video signal processor 10 can provide a result of scene detection as a 
basis for a new high-level access for the video browsing. Therefore, the video signal 
processor 10 permits an easy access to a video data, which is based on the content of 
the data, by visualizing the content of the video data using the high-level scene video 
stmcture, not any segments. For example, the video signal processor 10 displays a 
scene by which the user can quickly know the smnmary of a program and thus quickly 
find a part of the program in which he is interested. 

Furthermore, the video signal processor 10 can provide a result of scene 
detection as a basis for automatic outlining or summarizing of a video data. For a 
_c_onsist_ent_s_uinining-up, it is generall y necess ary to decompose a video data into 
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reconstructible meaningful components, not to combine together random fragments 
of a vide data. A scene detected by the video signal processor 10 serves as a basis for 
preparation of such a summary. 

It should be noted that the present invention is not limited to the embodiment 
having been described in the foregoing, but the features used for measurement of 
similarity between segments example may of course be other than those having been 
described and be appropriately modified without departing from the scope of the 
present invention defined later. 

Industrial Applicability 

As having been described in detail in the foregoing, the present invention 
provides the signal processing method for detecting and analyzing a pattern reflecting 
the semantics of the content of a signal, the method including steps of extracting, from 
a segment consisting of a sequence of consecutive frames forming together the signal, 
at least one feature which characterizes the properties of the segment; calculating, 
using the extracted feature, a criterion for measurement of a similarity between a pair 
of segments for every extracted feature and measuring a similarity between a pair of 
segments according to the similarity measurement criterion; and detecting, according 
to the feature and similarity measurement criterion, two of the segments, whose mutual 
time gap is within a predetennined temporal threshold and mutual dissiinilarity is less 
than a predetermined dissimilarity threshold, and grouping the segments into a^scene^ 
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consisting of a sequence of temporally consecutive segments reflecting the semantics 
of the signal content. 

Therefore, the signal processing method according to the present invention can 
detect similar segments in a signal and group them into a scene, thereby permitting to 
extract a higher-level structure than a segment. 

Also the present invention provides the video signal processor for detecting and 
analyzing a visual and/or audio pattern reflecting the semantics of the content of a 
supplied video signal, the apparatus including means for extracting, from a visual 
and/or audio segment consisting of a sequence of consecutive visual and/or audio 
frames forming together the video signal, at least one feature which characterizes the 
properties of the visual and/or audio segment; means for calculating, using the 
extracted feature, a criterion for measurement of a similarity between a pair of visual 
segments and/or audio segments for every extracted feature and measuring a similarity 
between a pair of visual segments and/or audio segments according to the similarity 
measurement criterion; and means for detecting, according to the feature and similarity 
measurement criterion, two of the visual segments and/or audio segments, whose 
mutual time gap is within a predetennined temporal threshold and mutual dissiinilarity 
is less than a predetermined dissimilarity threshold, and grouping the visual segments 
and/or audio segments into a scene consisting of a sequence of temporally consecutive 
visual segments and/or audio segments reflecting the semantics of the video signal 
content. 
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Therefore, the video signal processor according to the present invention can 
detect similar visual segments and/or audio seginents in the video signal and group 
them for output as a scene, thereby permitting to extract a higher-level video structure 
than a visual and/or audio segment. 
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CLAIMS 

1 . A signal processing method for detecting and analyzing a pattern reflecting the 
semantics of the content of a signal, the method comprising steps of: 

extracting, from a segment consisting of a sequence of consecutive fraines 
forming together the signal, at least one feature which characterizes the properties of 
the seginent; 

calculating, using the extracted feature, a criterion for measurement of a 
similarity between a pair of segments for every extracted feature and measuring a 
similarity between a pair of segments according to the similarity measurement 
criterion; and 

detecting, according to the feature and similarity measurement criterion, two of 
the seginents, whose mutual time gap is within a predetermined temporal threshold and 
mutual dissimilarity is less than a predetenrdned dissimilarity threshold, and grouping 
the seginents into a scene consisting of a sequence of temporally consecutive segments 
reflecting the semantics of the signal content. 

2. The method as set forth in Claim 1, wherein the signal is at least one of visual 
and audio signals included in a video data. 

3. The method as set forth in Claim 1, wherein at the feature extracting step, a 
single statistic central value of the plurality of features at different time points in a 
single segment is selected for extraction. 

4. The method as set forth in Claim 1, wherein a statistic value of the similarity 



li: II '^-S ii-H ^Mh 7' Cji . l .i. X idi ml, 11:11 1 n 



40 

between a plurality of segment pairs is used to detennine the dissimilarity threshold. 

5. The method as set forth in Claim 1 , wherein of the segments, more than at least 
one segment which could not have been grouped into a scene at the grouping step are 
grouped into a single scene. 

6. The method as set forth in Claim 1, wherein a result of scene detection from 
arbitrary features acquired at the grouping step and more than at least one result of 
scene detection for features different from the arbitrary ones, are combined together. 

7. The method as set forth in Claim 2, wherein more than at least one result of 
scene detection from the video signal acquired at the grouping step and more than at 
least one result of scene detection from the audio signal acquired at the grouping step, 
are combined together. 

8. A video signal processor for detecting and analyzing a visual and/or audio 
pattem reflecting the semantics of the content of a supphed video signal, the apparatus 
comprising: 

means for extracting, from a visual and/or audio seginent consisting of a 
sequence of consecutive visual and/or audio frames fonning together the video signal, 
at least one feature which characterizes the properties of the visual and/or audio 
segment; 

means for calculating, using the extracted feature, a criterion for measurement 
of a similarity between a pair of visual segments and/or audio segments for every 
extracted feature and measuring a similarity between a pair of visual seginents and/or 
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audio segments according to the similarity measurement criterion; and 

means for detecting, according to the feature and similarity measurement 
criterion, two of the visual segments and/or audio segments, whose mutual time gap 
is within a predetennined temporal threshold and mutual dissiinilarity is less than a 
predetennined dissiixdlarity threshold, and grouping the visual segments and/or audio 
segments into a scene consisting of a sequence of temporally consecutive visual 
segments and/or audio segments reflecting the semantics of the video signal content. 

9. The apparatus as set forth in Claim 8, wherein the feature extracting means 
selects, for extraction, a single statistic central value of the plurality of features at 
different time points in a single visual and/or audio segment. 

10. The apparatus as set forth in Claim 8, wherein a statistic value of the similarity 
between a plurality of visual and/or audio segiuent pairs is used to determine the 
dissimilarity threshold. 

11. The apparatus as set forth in Claim 8, wherein of the visual and/or audio 
segments, more than at least one visual and/or audio segment which could not have 
been grouped into a scene by the grouping means are grouped into a single scene. 

12. The apparatus as set forth in Claim 8, wherein a result of scene detection for 
arbitrary features acquired by the grouping means and more than at least one result of 
scene detection for features different from the arbitrary ones, are combined together. 

13. The apparatus as set forth in Claim 8, wherein more than at least one result of 
scene detection from the visual signal of the video signal acquired by the grouping 
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means and more than at least one result of scene detection from the audio signal of the 
video signal acquired by the grouping means, are combined together. 
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ABSTRACT 

The video signal processor 10 includes a scene detector 16 which uses features 
extracted for visual segments and/or audio segments resulted from seginentation of an 
input stream of video data, and a criterion for measurement of similarity between 
visual and/or audio segiuent pairs, calculated for each of the features using the 
similarity measurement criterion, to detect two visual segments and/or audio segments 
whose time gap is within a predetermined temporal threshold and whose dissimilarity 
is less than a predetermined dissiinilarity threshold and group the segments into a 
scene consisting of visual segments and/or audio segments reflecting the semantics of 
the video data content and temporally contiguous to each other. 
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